PLSC30500, Fall 2024

Part 4. Regression (part a)

Andy Eggers

Regression motivation

It’s great if your estimator is as simple as a sample mean \(\overline{X}\) or a difference in sample means \(\overline{Y}_1 - \overline{Y}_0\).

But \(>95\%\) of the time researchers in social science use OLS regression:

  • predict \(Y\) using many predictors \(X_{[1]}, X_{[2]}, \ldots\)
  • relationship between \(Y\) and \(X_{[1]}\) controlling for or adjusting for other variables
  • or just a difference in sample means! \(\overline{Y}_1 - \overline{Y}_0\)

To understand it we’ll use (almost) everything we have learned.

Regression theory

BLP with one predictor

  • CEF (conditional expectation function): \({\textrm E}[Y \mid X]\). Could be any shape.
  • BLP (best linear predictor): MSE-minimizing slope and intercept, i.e. \[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, \mathrm{E}\,[\left(Y - (a + bX)\right)^2]\]
  • Can be viewed as BLP of \(Y\) or BLP of CEF (\({\textrm E}[Y \mid X]\))
  • Solution: \(\beta = \frac{\text{Cov}[X, Y]}{{\textrm V}[X]}\), \(\alpha = {\textrm E}[Y] - \beta {\textrm E}[X]\)

Plug-in estimator: to get \(\hat{\alpha}\), \(\hat{\beta}\), plug in sample covariance, sample variance, sample means to solution above

Errors and residuals

Given a prediction \(\hat{Y}_i\) for an observation \(i\), the residual is

\[e_i = Y_i - \hat{Y}_i\]

If the prediction \(\hat{Y}_i\) is based on a true population predictor (e.g. CEF or BLP), then \(e_i = Y_i - \hat{Y}\) can be called an error.


(Residual : error :: Sample mean \(\overline{X}\) : expectation \({\textrm E}[X]\))

BLP with multiple predictors

  • CEF (conditional expectation function): \(E[Y \mid \mathbf{X}]\). Could be any shape.
  • BLP (best linear predictor): MSE-minimizing vector of coefficients, i.e.  \[(\beta_0, \beta_1, \ldots, \beta_K) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \mathrm{E}\,[\left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2]\]
  • Can be viewed as BLP of \(Y\) or BLP of CEF (\(E[Y \mid \mathbf{X}]\))
  • Solution: no simpler statement than the problem above

Plug-in estimator (Ordinary Least Squares, OLS):

\[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\, \left(Y - (b_0 + b_1X_{[1]} + \ldots + b_K X_{[K]})\right)^2,\] i.e. choose coefficients to minimize the mean of the squared residuals in the sample

How do we solve this problem?

Our problem is \[(\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K}) = \underset{(b_0, b_1, \ldots, b_K) \in \mathbb{R}^{K+1}}{\arg\min} \, \frac{1}{n}\sum_{i =1}^n\,\left(Y_i - (b_0 + b_1X_{[1]i} + \ldots + b_K X_{[K]i})\right)^2,\]

We could try all possible vectors of coefficients to find \((\hat{\beta_0}, \hat{\beta_1}, \ldots, \hat{\beta_K})\).

The solution (via linear algebra and calculus) is \[\hat{\boldsymbol{\beta}} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbf{Y}\]

where \(\mathbb{X}\) is an \(n \times K+1\) matrix of predictors and \(\mathbf{Y}\) is an \(n \times 1\) column vector (the dependent variable).

We’ll discuss more.

For now we use lm() or estimatr::lm_robust() in R.

Regressions and their output

Running the regression

Using CCES data,

dat <- read.csv("./../data/cces_2012_subset.csv")
dat$env[dat$env == 6] <- NA
use <- !is.na(dat$env) & !is.na(dat$aa) & !is.na(dat$gender)
dat <- dat[use, ]
dat$female <- ifelse(dat$gender == 2, 1, 0)

Regressing env (priority for jobs over environment, 1-5) on aa (opposition to affirmative action, 1-4) and female (0-1):

(my_reg <- lm(env ~ aa + female, data = dat))

Call:
lm(formula = env ~ aa + female, data = dat)

Coefficients:
(Intercept)           aa       female  
    1.10335      0.68121     -0.00335  

The result is an estimate of a BLP of the form \[\text{Env}_i = \beta_0 + \beta_1 \text{AffAct}_i + \beta_2 \text{Female}_i\]

Accessing/interpreting regression results

summary(my_reg)

Call:
lm(formula = env ~ aa + female, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8282 -0.8248  0.1718  0.8564  3.2188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.10335    0.03735  29.545   <2e-16 ***
aa           0.68121    0.01118  60.930   <2e-16 ***
female      -0.00335    0.02319  -0.145    0.885    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.095 on 9179 degrees of freedom
Multiple R-squared:  0.2905,    Adjusted R-squared:  0.2903 
F-statistic:  1879 on 2 and 9179 DF,  p-value: < 2.2e-16

Presenting regression results

my_reg_0 <- lm(env ~ aa, data = dat)
my_reg_1 <- lm(env ~ female, data = dat)
modelsummary::modelsummary(list(my_reg_0, my_reg_1, my_reg), 
                           stars = T,
                           gof_map = c("r.squared", "nobs"))
(1) (2) (3)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 1.101*** 3.181*** 1.103***
(0.035) (0.018) (0.037)
aa 0.681*** 0.681***
(0.011) (0.011)
female -0.156*** -0.003
(0.027) (0.023)
R2 0.290 0.004 0.291
Num.Obs. 9182 9182 9182

Accessing regression results

coef(my_reg)
 (Intercept)           aa       female 
 1.103350634  0.681206314 -0.003350409 
nobs(my_reg)
[1] 9182
names(my_reg)
 [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"        
my_reg$coefficients
 (Intercept)           aa       female 
 1.103350634  0.681206314 -0.003350409 
head(my_reg$residuals)
         1          2          3          4          5          6 
 0.1718241 -0.7812065  1.1751745  1.2187935 -1.1469696 -1.4657633 
names(summary(my_reg))
 [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"         "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 
summary(my_reg)$coefficients
                Estimate Std. Error    t value      Pr(>|t|)
(Intercept)  1.103350634 0.03734519 29.5446489 2.425620e-183
aa           0.681206314 0.01118021 60.9296538  0.000000e+00
female      -0.003350409 0.02318507 -0.1445072  8.851031e-01

\(R^2\)

The \(R^2\) measures how predictive the predictors are (in-sample):

  • \(R^2 = 0\): not at all predictive, useless
  • \(R^2 = 1\): perfectly predictive

Let \(e_i = Y_i - \hat{Y}_i\) be the residual for \(i\) from the model.

Let \(\varepsilon_i = Y_i - \overline{Y}\) be the residual for \(i\) from a null model – no predictors.

“By what proportion does the mean squared residual go down (compared to null model) when we include our predictors?”

\[\begin{align}R^2 &= \frac{\frac{1}{n} \sum_{i = 1}^n \varepsilon_i^2 - \frac{1}{n} \sum_{i = 1}^n e_i^2}{\frac{1}{n} \sum_{i = 1}^n \varepsilon_i^2} \\ &= 1 - \frac{\sum_{i = 1}^n e_i^2}{\sum_{i = 1}^n \varepsilon_i^2}\end{align}\]

\(R^2\) (2)

summary(my_reg)$r.squared
[1] 0.2905007
1 - sum(my_reg$residuals^2)/sum((dat$env - mean(dat$env))^2)
[1] 0.2905007

\(R^2\) as “variance explained”:

1 - var(my_reg$residuals)/var(dat$env)
[1] 0.2905007

The BLP and regression specifications

BLP as “modest” justification for OLS

There is a CEF but we’re not claiming to be able to find it. (cf Gauss-Markov theorem)

Instead, trying to estimate a BLP, whose form we choose by deciding what \(X_{[1]}, X_{[2]}, \ldots, X_{[K]}\) are (what variables, what transformations, what interactions).

The BLP is the “best” approximation to \(Y\) and the CEF given the chosen predictors, and OLS is a consistent estimator of it, but the BLP might not be helpful!


So how do we choose a BLP, i.e. a specification?

BLP and estimands

When does the BLP tell us something useful?

  • Predict \(Y\): BLP is a predictive model (e.g. infant mortality in a region predicted using data on water quality and electrification)
  • Describe relationship between \(X\) and \(Y\) with
    • a BLP coefficient, with or without “controlling for” other factors (e.g. linear relationship between water quality and infant mortality, adjusting for electrification)
    • a function of BLP coefficients and sample features (e.g. average partial derivative of infant mortality with respect to water quality, adjusting for electrification)
  • Estimate causal estimands relating to effect of treatment \(D\) on outcome \(Y\) with
    • a regression coefficient (e.g. difference in means in experimental data)
    • predictive models for potential outcomes \(Y(1)\) and \(Y(0)\)

Don’t be fooled by the L in BLP

In the simple case with continuous \(X\) and \(Y\), our BLP was a line: \(\alpha + \beta X\).

The BLP is linear in the sense that it is the weighted sum of components \(X_{[1]}, X_{[2]}, \ldots, X_{[K]}\):

\[\beta_0 + \beta_1 X_{[1]} + \beta_2 X_{[2]} + \ldots + \beta_K X_{[K]}\]

But the BLP does not need to be a linear function of any particular variable, e.g. GDPPC or education. For example:

  • polynomials: \(X_{[1]}\) could be \(\text{GDPPC}\), \(X_{[2]}\) could be \(\text{GDPPC}^2\)
  • binning: \(X_{[1]}\) could be 1 if \(\text{GDPPC} \in [1000, 2000)\) and otherwise 0, \(X_{[2]}\) could be 1 if \(\text{GDPPC} \in [2000, 3000)\) and otherwise 0, etc
  • interactions: \(X_{[1]}\) could be \(\text{GDPPC}\), \(X_{[2]}\) could be \(\text{area}\), \(X_{[3]}\) could be \(\text{GDPPC} \times \text{area}\)

Categorical predictors and interactions

Butler & Broockman (2011)

Butler & Broockman conducted an audit experiment where they emailed about 6000 US state legislators asking for help with registering to vote.

Key variables:

  • treat_deshawn (1 if email from “Deshawn Jackson”, 0 if email from “Jake Mueller”)
  • reply_atall (1 if legislator responded, 0 otherwise)
  • leg_R (1 if legislator Republican, 0 otherwise)
mean(bb$reply_atall[bb$treat_deshawn == 1])
[1] 0.556425
mean(bb$reply_atall[bb$treat_deshawn == 0])
[1] 0.5742493
bb |> 
  group_by(treat_deshawn) |> 
  summarize(mean(reply_atall))
# A tibble: 2 × 2
  treat_deshawn `mean(reply_atall)`
          <dbl>               <dbl>
1             0               0.574
2             1               0.556

Analyzing B&B with regression

lm(reply_atall ~ treat_deshawn, data = bb) 

Call:
lm(formula = reply_atall ~ treat_deshawn, data = bb)

Coefficients:
  (Intercept)  treat_deshawn  
      0.57425       -0.01782  

What is the reply rate to emails

  • from Jake Mueller?
  • from Deshawn Jackson?

With a single binary \(X\), the BLP \(\alpha + \beta X\) is also the CEF (two expectations).

Analyzing B&B with regression (2)

We can make R give us the two sample means in various ways:

lm(reply_atall ~ treat_deshawn, data = bb) |> coef()
  (Intercept) treat_deshawn 
   0.57424928   -0.01782424 
bb$treatment <- ifelse(bb$treat_deshawn == 1, "Deshawn", "Jake")
lm(reply_atall ~ treatment, data = bb) |> coef()
  (Intercept) treatmentJake 
   0.55642504    0.01782424 
lm(reply_atall ~ I(treatment == "Deshawn"), data = bb) |> coef()
                  (Intercept) I(treatment == "Deshawn")TRUE 
                   0.57424928                   -0.01782424 
# no intercept via -1
lm(reply_atall ~ treatment - 1, data = bb) |> coef()
treatmentDeshawn    treatmentJake 
       0.5564250        0.5742493 
bb$treatment_f <- factor(bb$treatment, levels = c("Jake", "Deshawn"))
lm(reply_atall ~ treatment_f, data = bb) |> coef()
       (Intercept) treatment_fDeshawn 
        0.57424928        -0.01782424 

Takeaways on specifying categorical regressions

  • if we regress on a character or factor variable, R omits one category; each other category gets a coefficient indicating difference from the omitted category
  • you can extract the predictions for each group from the coefficients, regardless of the omitted category
  • you can also determine which group is omitted, and one way can be more useful than another

Adding a predictor

Recall leg_R indicates whether the email recipient was a Republican.

lm(reply_atall ~ treat_deshawn + leg_R, data = bb) |>  
  coef()
  (Intercept) treat_deshawn         leg_R 
   0.54662798   -0.01775660    0.06177313 

If we write the prediction equation as \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \mathrm{DeShawn}_i + \hat{\beta}_2 \mathrm{LegRep}_i,\) then we have

Email Recipient Predicted response rate
Jake Dem \(\hat{\beta}_0\)
DeShawn Dem \(\hat{\beta}_0 + \hat{\beta}_1\)
Jake Rep \(\hat{\beta}_0 + \hat{\beta}_2\)
DeShawn Rep \(\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2\)

So what is the predicted DeShawn-vs-Jake difference for Democrats? For Republicans?

Comparing predictions to actual means

To make predictions match actual means (and allow effect of treatment to vary by party of legislator): add an interaction

Motivating interactions

We start with

\[Y_i = a_0 + a_1 X_i + a_2 D_i\]

Let’s let \(a_2\) depend (linearly) on the value of \(X_i\)!

\[a_2 = b_0 + b_1 X_i \]

Substitute in \[Y_i = a_0 + a_1 X_i + (b_0 + b_1 X_i) D_i \] and rearrange \[Y_i = a_0 + a_1 X_i + b_0 D_i + b_1 X_i D_i\]

(Same result if we make \(a_1\) depend on \(D_i\).)

Adding an interaction

lm(reply_atall ~ treat_deshawn*leg_R, data = bb) |> coef()
        (Intercept)       treat_deshawn               leg_R treat_deshawn:leg_R 
         0.52976190          0.01596300          0.09949293         -0.07550407 

If we write the prediction equation as \[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 \mathrm{DeShawn}_i + \hat{\beta}_2 \mathrm{LegRep}_i + \hat{\beta}_3 \mathrm{DeShawn}_i \times \mathrm{LegRep}_i\], then we have

Email Recipient Predicted response rate
Jake Dem \(\hat{\beta}_0\)
DeShawn Dem \(\hat{\beta}_0 + \hat{\beta}_1\)
Jake Rep \(\hat{\beta}_0 + \hat{\beta}_2\)
DeShawn Rep \(\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_3\)

Another way

You can arrange the regression to tell you means more directly:

bb$email_party <- paste(ifelse(bb$treat_deshawn == 1, "Deshawn", "Jake"),
                        ifelse(bb$leg_R == 0, "D", "R"),
                        sep = "_") 
table(bb$email_party)

Deshawn_D Deshawn_R    Jake_D    Jake_R 
     1345      1083      1344      1087 
lm(reply_atall ~ email_party, data = bb) |> coef()
         (Intercept) email_partyDeshawn_R    email_partyJake_D    email_partyJake_R 
          0.54572491           0.02398885          -0.01596300           0.08352992 
lm(reply_atall ~ email_party - 1, data = bb) |> coef()
email_partyDeshawn_D email_partyDeshawn_R    email_partyJake_D    email_partyJake_R 
           0.5457249            0.5697138            0.5297619            0.6292548 

Takeways

  • you decide the form of the BLP: do you want an intercept and coefficients for the non-omitted groups? do you want means for each category?
  • if you have only categorical predictors and their interactions (saturated regression), OLS gives you sample means for each category (though you may have to combine coefficients to recover those sample means)

Inference for regression

Inference for a single coefficient

Inference procedure for a single coefficient \(\hat{\beta_1}\) is just like inference for a sample mean:

  • (estimated) standard error \(\hat{\sigma}[\hat{\beta_1}]\) comes from regression output
  • use \(\hat{\sigma}[\hat{\beta_1}]\) to produce confidence interval, get \(p\)-value (probability of coefficient further from null if population coefficient were zero) assuming sampling distribution approximately normal

Which standard errors?

  • classical standard errors (default for lm()) assume regression errors
    • are independent: \(\text{Cov}[e_i, e_j] = 0 \, \forall \, i, j\),
    • have equal variance: \({\textrm V}[e_i \mid X_i] = {\textrm V}[e_i]\) (homoskedasticity)
  • robust standard errors (default for estimatr::lm_robust()) assume regression errors are independent, but may have different variances (heteroskedasticity)
  • clustered standard errors (clusters option for estimatr::lm_robust()) assume errors may be correlated within defined clusters
  • bootstrap standard errors: naive version makes same assumption as robust; can incorporate clustered sampling

Inference for combinations of coefficients

Often we are interested in linear combinations of coefficients, e.g. 

  • because a quantity of interest is estimated by the sum of coefficients: e.g. in interactive model above,
    • estimated reply rate for Democrats receiving Deshawn email is \(\hat{\beta}_0 + \hat{\beta}_1\)
    • estimated effect of Deshawn email for Republicans is \(\hat{\beta}_1 + \hat{\beta}_3\)
  • because we want to predict the outcome at specific values of the predictors: e.g. \(\hat{Y}_i = \hat{\beta}_0 + 2\hat{\beta}_1 + 3\hat{\beta}_1\)

If we want a CI or \(p\)-value, we need the standard error of a (weighted) sum of coefficients.

Recall variance rule: in general form, \[{\textrm V}[aX + bY + c] = a^2{\textrm V}[X] + b^2{\textrm V}[Y] + 2 a b \text{Cov}[X, Y]\]

Is e.g. \(\text{Cov}[\hat{\beta}_0, \hat{\beta}_1] = 0\) in \(Y_i = \hat{\beta}_0 + \hat{\beta}_1 \text{Deshawn}_i\)?

Variance-covariance matrix

intx_reg <- lm(reply_atall ~ treat_deshawn*leg_R, data = bb)
vcov(intx_reg) |> round(5)
                    (Intercept) treat_deshawn    leg_R treat_deshawn:leg_R
(Intercept)             0.00018      -0.00018 -0.00018             0.00018
treat_deshawn          -0.00018       0.00036  0.00018            -0.00036
leg_R                  -0.00018       0.00018  0.00041            -0.00041
treat_deshawn:leg_R     0.00018      -0.00036 -0.00041             0.00081

Standard error of effect of Deshawn email for Republicans:

coef_names <- c("treat_deshawn", "treat_deshawn:leg_R")
(vcov_part <- vcov(intx_reg)[coef_names, coef_names])
                    treat_deshawn treat_deshawn:leg_R
treat_deshawn        0.0003637887       -0.0003637887
treat_deshawn:leg_R -0.0003637887        0.0008145863
(sigma_hat_R <- sqrt(sum(vcov_part))) # V(X) + V(Y) + 2*Cov(X, Y)
[1] 0.021232

Presenting confidence intervals

Polynomials and overfitting

A dataset

Here is a dataset on U.S. presidential elections from 1948 to 2016:

Variable Name Description
year Election year
deminc 1=incumbent is a Democrat
incvote Percent of the two-party vote for the incumbent party
q2gdp Second-quarter change in real GDP in election year
juneapp Net approval of incumbent president in June of election year

Incumbent approval rating and incumbent party vote share

OLS regression


Call:
lm(formula = incvote ~ juneapp, data = pres)

Coefficients:
(Intercept)      juneapp  
    51.0368       0.1649  

Avoid causal language in describing the coefficient on juneapp. Say e.g.

“One percentage point higher presidential approval in June is associated with .16 percentage points higher vote share for the president’s party in the November election”

Adding polynomials three ways (1)

# add variables to data
# base R version
pres2 <- pres
pres2$juneapp_squared <- pres2$juneapp^2
pres2$juneapp_cubed <- pres2$juneapp^3
# "tidy" version
pres2 <- pres |> 
  mutate(juneapp_squared = juneapp^2,
         juneapp_cubed = juneapp^3)
# regression
lm(incvote ~ juneapp + juneapp_squared + juneapp_cubed, data = pres2) 

Call:
lm(formula = incvote ~ juneapp + juneapp_squared + juneapp_cubed, 
    data = pres2)

Coefficients:
    (Intercept)          juneapp  juneapp_squared    juneapp_cubed  
      5.097e+01        2.165e-01        3.833e-04       -2.811e-05  

Adding polynomials three ways (2)

# add variables to formula using I()
lm(incvote ~ juneapp + I(juneapp^2) + I(juneapp^3), data = pres)

Call:
lm(formula = incvote ~ juneapp + I(juneapp^2) + I(juneapp^3), 
    data = pres)

Coefficients:
 (Intercept)       juneapp  I(juneapp^2)  I(juneapp^3)  
   5.097e+01     2.165e-01     3.833e-04    -2.811e-05  

Adding polynomials three ways (3)

# poly() function
lm(incvote ~ poly(juneapp, degree = 3, raw = T), data = pres) |> summary()

Call:
lm(formula = incvote ~ poly(juneapp, degree = 3, raw = T), data = pres)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1491 -1.4341  0.3028  1.5728  5.9915 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          5.097e+01  1.063e+00  47.932  < 2e-16 ***
poly(juneapp, degree = 3, raw = T)1  2.165e-01  7.136e-02   3.034  0.00893 ** 
poly(juneapp, degree = 3, raw = T)2  3.833e-04  1.560e-03   0.246  0.80955    
poly(juneapp, degree = 3, raw = T)3 -2.811e-05  4.131e-05  -0.681  0.50725    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.33 on 14 degrees of freedom
Multiple R-squared:  0.6723,    Adjusted R-squared:  0.6021 
F-statistic: 9.576 on 3 and 14 DF,  p-value: 0.001076

Plotting polynomial relationships

vote_reg <- lm(incvote ~ poly(juneapp, degree = 3, raw = T), data = pres)
juneapps <- data.frame(juneapp = seq(min(pres$juneapp), max(pres$juneapp), length = 100))
predictions <- predict(vote_reg, newdata = juneapps)
plot(pres$juneapp, pres$incvote)
lines(juneapps$juneapp, predictions)

Plotting polynomial relationships

pres |>  
  ggplot(aes(x = juneapp, y = incvote)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F)

Plotting polynomial relationships

pres |>  
  ggplot(aes(x = juneapp, y = incvote)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F,  
              formula = y ~ poly(x, 3)) 

Plotting polynomial relationships

pres |>  
  ggplot(aes(x = juneapp, y = incvote)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F,
              formula = y ~ poly(x, 12)) +
  coord_cartesian(ylim = c(40, 70))

Overfitting

As you add more polynomials (more generally, add predictors to the model), you reduce the mean squared residual (MSE) in the sample. But you may be increasing mean squared residual (MSE) in the population due to overfitting.

Suppose the dots below are the population, and the red line is the CEF:

Overfitting (2)

We will take samples of size \(n\) from the population (size \(N\)) and estimate the BLP with different orders of polynomial \(k\):

\[Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \ldots + \beta_k X_i^k \]

As we increase \(k\), what will happen to mean squared residual in sample (i.e. \(\frac{1}{n} \sum (Y_i - \hat{Y}_i)^2\)) vs in population (i.e. \(\frac{1}{N} \sum (Y_i - \hat{Y}_i)^2\))?

How does this depend on sample size \(n\)?

Overfitting (3)

Overfitting (4)

Takeaways

  • we can specify a BLP that is a very flexible function of a predictor \(X\) with polynomials, binning, splines
  • more flexibility \(\rightarrow\) better fit (lower mean squared residuals) in the sample
  • for small samples, more flexibility \(\rightarrow\) worse fit in the population due to over-fitting